First we need to set up the environment and load the packages we will use for this workshop.
library(Seurat): Loads the Seurat package, which is a comprehensive toolkit for single-cell RNA sequencing and spatial transcriptomics data analysis. It provides a wide range of functions for data preprocessing, normalization, clustering, dimensionality reduction, and visualization. Explore documentation here: https://satijalab.org/seurat/
library(ggplot2): Loads the ggplot2 package, a powerful and flexible system for creating static visualizations in R. Explore documentation here: https://ggplot2.tidyverse.org/
library(scCustomize): Loads the scCustomize package, which provides custom functions and themes to enhance the visualization and analysis capabilities of single-cell and spatial transcriptomics data, often in conjunction with Seurat. Explore documentation here: https://samuel-marsh.github.io/scCustomize/
library(readr): Loads readr package for fast and friendly reading of rectangular data, such as CSV files, into R.
library(pheatmap): Loads pheatmap package, which is for creating pretty heatmaps, offering better control over heatmap customization compared to base R.
library(matrixStats): matrixStats provides highly optimized functions for matrix operations, particularly useful for computing row and column summaries.
library(spdep): spdep stands for Spatial Dependence and Spatial Autocorrelation, and it provides functions for spatial data analysis, including spatial weights generation, spatial autocorrelation statistics, and spatial regression.
library(geojsonR) The geojsonR library is used for handling GeoJSON data in R. GeoJSON is a format for encoding a variety of geographic data structures using JavaScript Object Notation (JSON). It is sometimes used as a format for storing cell segmentation boundaries.
library(Seurat)
library(ggplot2)
library(scCustomize)
library(readr)
library(pheatmap)
library(matrixStats)
library(spdep)
Loading required package: spData
To access larger datasets in this package, install the spDataLarge package with: `install.packages('spDataLarge',
repos='https://nowosad.github.io/drat/', type='source')`
Loading required package: sf
Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
library(geojsonR)
Sets the path to the directory containing the Xenium output data - this is the directory where all of the outputs are stored.
data_dir <- "/project/shared/spatial_data_camp/datasets/DATASET2/XENIUM_COLON_SUBSET"
ReadXenium reads Xenium spatial transcriptomics data from a specified directory using a Seurat wrapper function that supports this data format. Xenium data typically includes expression matrices and spatial coordinates, along with other information about cell centroids and segmentations and coordinates of individual transcripts.
data_dir: The path to the directory containing the Xenium data, set in the previous step. outs = c(“matrix”, “microns”): Specifies the outputs to read from the data directory. matrix refers to summarised cell by gene matrix and microns refers to individual transcript coordinates.
type = c(“centroids”, “segmentations”): Indicates the types of spatial information to include - here, we are reading ib both cell centroid coordinates and cell boundary segmentations.
data <- ReadXenium(data_dir, outs = c("matrix", "microns"), type=c("centroids", "segmentations"))
10X data contains more than one type and is being returned as a list containing matrices of each type.
This provides us a list of data:
names(data)
[1] "matrix" "microns" "centroids" "segmentations"
Matrix is further split into gene expression matrix and various control probes and codewords. Different platforms and platform versions include different control probes. As this will vary, it’s important to check and understand what the specific controls in your own data are.
Here, negative control probes are probes that are added to the reaction but target non-biological sequences and should not bind any tissue RNA. Negative control codewords are valid codewords, but no probes with that codeword added to the reaction. This effectively tells us how good the transcript calling algorithm is.
names(data$matrix)
[1] "Gene Expression" "Negative Control Codeword" "Negative Control Probe" "Unassigned Codeword"
Read in additional information about the cells - this gives us pre-calculated information, for example segmented cell or nucleus size for each cell.
cell_meta_data <- read.csv(file.path(data_dir, "cells.csv.gz"))
rownames(cell_meta_data) <- cell_meta_data$cell_id
head(cell_meta_data)
We will start by creating a basic seurat object from the data.
CreateSeuratObject function initializes a Seurat object using the provided gene expression matrix and optional metadata.
counts: The gene expression matrix, which contains the raw count data for each gene in each cell. data$matrix[[“Gene Expression”]]: Specifies the gene expression matrix extracted from the loaded Xenium data. Here, we leave out the control probes for now.
assay: The name of the assay - you can call it anything you like. Here, we go with “XENIUM”.
meta.data: Metadata associated with the cells or spots. Here, we add the cell statistics we read in earlier as cell_meta_data.
By printing the seurat object, we can see that we read in ~ 30,000 cells with measures for 325 genes
seurat <- CreateSeuratObject(counts = data$matrix[["Gene Expression"]],
assay = "XENIUM",
meta.data = cell_meta_data)
seurat
An object of class Seurat
325 features across 32872 samples within 1 assay
Active assay: XENIUM (325 features, 0 variable features)
1 layer present: counts
Adding spatial coordinates to a Seurat object allows for spatially resolved analysis and visualization. This requires creating objects for centroids and segmentations we read in earlier, and then integrating these with the main Seurat object.
CreateFOV: This function creates a field of view (FOV) object that includes spatial information about the centroids, segmentations, and molecule coordinates. An FOV can be the entire slide, or a selected region within a slide - i.e. it does not need to have entries for all the cells in the seurat object.
coords: A list containing the centroids and/or segmentation data. For larger datasets, it can be quicker to only load centroids, as this minimises the amount of data points.
centroids = CreateCentroids(data\(centroids)*: Creates a centroids object from the centroid data in the Xenium dataset. *segmentation = CreateSegmentation(data\)segmentations): Creates a segmentation object from the segmentation data in the Xenium dataset.
type = c(“segmentation”, “centroids”): Specifies the types of spatial data being included, which are segmentation and centroid data.
molecules = data$microns: The spatial coordinates of individual transcripts/molecules in the data. This is optional - for larger datasets, skipping transcript coordinates can be a good idea.
seurat[[“COLON”]] <- coords: Adds the created FOV object to the Seurat object under the new FOV name “COLON”. This can be named (almost) anything - but, avoid using underscores as this can create some unexpected behaviours later.
TIP: LoadXenium() is a wrapper that would load in both cell counts matrix and spatial coordinates in one function, simplifying these steps. However, in situ platforms are evolving at a very fast rate and there are constant changes on how the data is stored, in particular for file formats for cell segmentation and coordinates. Here, we have broken down the steps to show how to assemble an in situ seurat object from the key components, in case the platform specific readers don’t work for your specific data.
coords <- CreateFOV(coords = list(centroids = CreateCentroids(data$centroids),
segmentation = CreateSegmentation(data$segmentations)),
type = c("segmentation", "centroids"),
molecules = data$microns,
assay = "XENIUM")
seurat[["COLON"]] <- coords
Inspect the object - now, you can see we have added a spatial field of view:
seurat
An object of class Seurat
325 features across 32872 samples within 1 assay
Active assay: XENIUM (325 features, 0 variable features)
1 layer present: counts
1 spatial field of view present: COLON
Adding control probes and codewords as separate assays in the Seurat object allows for the tracking and analysis of technical artifacts and noise within your spatial transcriptomics data, while keeping these outputs separate from the main biological gene expression values.
Unassigned codewords are unused codewords. There is no probe in a particular gene panel that will generate the codeword.
Negative control probes are probes that exist in the panels but target non-biological sequences. They can be used to assess the specificity of the assay.
Negative control codewords are codewords in the codebook that do not have any probes matching that code. They are chosen to meet the same requirements as regular codewords and can be used to assess the specificity of the decoding algorithm.
seurat[["Negative.Control.Codeword"]] <- CreateAssayObject(counts = data$matrix[["Negative Control Codeword"]])
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
seurat[["Negative.Control.Probe"]] <- CreateAssayObject(counts = data$matrix[["Negative Control Probe"]])
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
seurat[["Unassigned.Codeword"]] <- CreateAssayObject(counts = data$matrix[["Unassigned Codeword"]])
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
Inspect the object:
seurat
An object of class Seurat
541 features across 32872 samples within 4 assays
Active assay: XENIUM (325 features, 0 variable features)
1 layer present: counts
3 other assays present: Negative.Control.Codeword, Negative.Control.Probe, Unassigned.Codeword
1 spatial field of view present: COLON
Let’s start with some basic QC and visualisation of the data.
In Seurat, in situ spatial transcriptomics counterpart functions to ‘SpatialDimPlot’ and ‘SpatialFeaturePlot’ we covered yesterday are called ‘ImageFeaturePlot’ and ‘ImageDimPlot’. These have additional functionality to plot cell segmentations and individual transcript coordinates, but otherwise function exactly the same as the sequencing based ST counterparts.
First, lets visualise the total transcripts detected per cell.
As in scRNA-Seq data, this is the most basic measure of overall signal and how well the data looks.
Unlike in scRNA-Seq data or unbiased sequencing-based ST, these measures are also very heavily dependent not only on the total RNA quantity of each cell and tissue quality, but also on the target panel used for the experiment. Under-represented cell types will naturally yield fewer transcripts. Finally, the quality of cell segmentation also plays a role.
In this case, we can see that there are areas with higher and lower total transcripts detected.
Understanding your tissue and target panel here is important to delineate where these differences are biological and where they may be technical.
ImageFeaturePlot(seurat, "nCount_XENIUM") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Similarly, we can visualise the total number of gene detected per cell. You can see that this is a bit less variable across tissue.
This can also suggest that there cells at the top of the epithelial crypts in this sample with genes detected at high copy number than the rest of the tissue.
ImageFeaturePlot(seurat, "nFeature_XENIUM") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
This code examines the distribution of the number of features (genes) detected per cell in the Seurat object using a density plot and calculates specific quantiles of this distribution. This is important for understanding the variability and distribution of detected features, which can help identify potential issues such as low-quality cells and determine any filtering thresholds that may need to be applied.
If you’re coming from scRNA-Seq work, these low numbers probably look very alarming. How can you possibly work with 31 median genes per cell?
Unlike scRNA-Seq data and sequencing-based ST, both gene dropouts and noise are much, much lower in in situ ST data.
We are also working with 100-fold fewer targetted genes.
ggplot(seurat[[]], aes(nFeature_XENIUM)) + geom_density()
quantile(seurat$nFeature_XENIUM, c(0.01, 0.1, 0.5, 0.9, 0.99))
1% 10% 50% 90% 99%
5 13 29 56 78
Using ImageFeaturePlot to visualize the cell area in spatial transcriptomics data allows us to examine the spatial organization and potential heterogeneity of cell sizes within your tissue sample.
Why do we get such a difference in spatial distribution of cell sizes?
This could be due to biological differences between small and large cells - e.g. small cells like T-cells.
However, here the signal correlates with areas of low cellularisation. Therefore, it is likely this is an artefact of nuclei expansion in cell segmentation.
What is Nuclei Expansion?
Nuclei expansion in cell segmentation refers to the process of enlarging the segmented nuclei regions to approximate the boundaries of the entire cells. This technique is used to better represent the actual cell boundaries when only the nuclei have been explicitly segmented/we only have DAPI and no additional cell boundary staining. The primary goal is to provide a more accurate estimation of the cellular area, which is crucial for various downstream analyses in spatial transcriptomics and single-cell studies. In this case, nuclei expansion is constrained either by maximum distance or other nearby cells - so, where there are no other nearby cells to “bump into”, the expansion generates artificially bigger cells.
ImageFeaturePlot(seurat, "cell_area") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
We can further check that this is likely the case by plotting the ratio between nuclei and total cell area. We can see that there is a very big decrease in percentage of cell area occupied by nucleus in areas of low cell density.
The cell-to-nucleus area ratio can also potentially provide insights into cell morphology, cell type and potential changes in cellular states or conditions. For example, T-Cells can often be quite well identified by this variable alone, as they have a small cytoplasm volume. However, without a cell boundary stain, this metric mainly captures segmentation artefacts, so be careful about over-interpretation!
seurat$cell_nucleus_ratio <- seurat$nucleus_area / seurat$cell_area
ImageFeaturePlot(seurat, "cell_nucleus_ratio") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
If we look at the distribution, we see that we have a big tail end of overly large cells.
ggplot(seurat[[]], aes(cell_area)) + geom_density()
In this case, we can see that as expected, there is generally a correlation between cell area and transcript detection rate.
However, we also have a group of cells where this is not the case - very large cells but relatively few transcripts. These cells are mainly submucosal stromal cells which are very poorly covered by the panel 10x have used.
ggplot(seurat[[]], aes(nCount_XENIUM, cell_area)) + geom_point()
We can create a filter to remove the overly large cells from the analysis.
quantile(seurat$cell_area, 0.99): Calculates the 99th percentile of the cell_area values in the Seurat object. This value serves as a threshold to identify the largest 1% of cells - but what is a sensible threshold, if any, depends on your tissue.
seurat\(cell_area < quantile(seurat\)cell_area, 0.99): Compares each cell’s area to the 99th percentile threshold. The result is a logical vector where each element is TRUE if the corresponding cell’s area is less than the 99th percentile and FALSE otherwise.
seurat[[“SIZE_FILTER_LARGE”]]: Creates a new metadata field named SIZE_FILTER_LARGE in the Seurat object, storing the logical vector.
seurat[["SIZE_FILTER_LARGE"]] <- seurat$cell_area < quantile(seurat$cell_area, .99)
Now we can use ImageDimPlot to visualise the cells which have been flagged for removal.
We can see that these are mostly in the submucosa region.
How do different thresholds behave? Is there a more appropriate one to use? Is any necessary at all?
ImageDimPlot(seurat, group.by="SIZE_FILTER_LARGE")
We can use the same approach to create a filter for segmented cells which are very small and likely segmentation arfetacts.
quantile(seurat$cell_area, 0.01): Calculates the 1st percentile of the cell_area values in the Seurat object. This value serves as a threshold to identify the smallest 1% of cells.
seurat\(cell_area > quantile(seurat\)cell_area, 0.01): Compares each cell’s area to the 1st percentile threshold. The result is a logical vector where each element is TRUE if the corresponding cell’s area is greater than the 1st percentile and FALSE otherwise.
seurat[[“SIZE_FILTER_SMALL”]]: Creates a new metadata field named SIZE_FILTER_SMALL in the Seurat object, storing the logical vector.
seurat[["SIZE_FILTER_SMALL"]] <- seurat$cell_area > quantile(seurat$cell_area, .01)
Now we can use ImageDimPlot to visualise the cells which have been flagged for removal.
We can see that these are more scattered throughout the tissue - but there may be more in the follicular regions.
How do different thresholds behave? Is there a more appropriate one to use? Is any necessary at all?
ImageDimPlot(seurat, group.by="SIZE_FILTER_SMALL")
We can check how these values correlate with gene detection rate.
If we filter out small cells, we will remove cells with low numbers of genes detected.
If we filter out large cells, this is not that biased towards overly large counts, as we saw before.
p1 <- VlnPlot(seurat, "nFeature_XENIUM", group.by = "SIZE_FILTER_SMALL", pt.size = .1, alpha = .5) + labs(title="Small Cell Filter")
Warning: Default search for "data" layer in "XENIUM" assay yielded no results; utilizing "counts" layer instead.
p2 <- VlnPlot(seurat, "nFeature_XENIUM", group.by = "SIZE_FILTER_LARGE", pt.size = .1, alpha = .5)+ labs(title="Large Cell Filter")
Warning: Default search for "data" layer in "XENIUM" assay yielded no results; utilizing "counts" layer instead.
p1 + p2
Adjusting the threshold for what is considered a “small cell” can have significant implications for your analysis, especially in areas with specific cell types such as T-cells, which are small and densely packed in follicular regions. This example demonstrates how changing the threshold to the 10th percentile affects the filtering. In this case, we would probably filter out a lot of good cells that we don’t want to lose! So, be careful when looking at these types of QC metrics!
seurat[["SIZE_FILTER_SMALL"]] <- seurat$cell_area > quantile(seurat$cell_area, .1)
ImageDimPlot(seurat, group.by="SIZE_FILTER_SMALL")
Lets set this back to the original 1% threshold.
seurat[["SIZE_FILTER_SMALL"]] <- seurat$cell_area > quantile(seurat$cell_area, .01)
The most important filter is the overall transcript detection. Empty cells or cells with very low transcript count cannot be taken forward for clustering analysis and it is extremely difficult to identify what they may be. Here, we set a threshold of minimum 15 transcripts. This seems quite low - for data from in situ platforms with low noise (Xenium, Merfish, Merscope), this is generally enough to cluster and identify cell types. If your data has more noise (e.g. CosMx), a higher threshold is more appropriate.
seurat\(nCount_XENIUM >= 15*: Compares each cell's transcript count to the threshold of 15. The result is a logical vector where each element is TRUE if the corresponding cell has at least 15 transcripts and FALSE otherwise. *seurat\)TRANSCRIPT_FILTER: Creates a new metadata field named TRANSCRIPT_FILTER in the Seurat object, storing the logical vector.
seurat$TRANSCRIPT_FILTER <- seurat$nCount_XENIUM >= 15
And we can visualise the cells that we would lose.
We see that we disproportionately would filter out more cells from some regions than others. As pointed out previously, this is likely due to a combination of gene panel coverage in some regions and very small cells in densely packed regions like follicles.
ImageDimPlot(seurat, group.by="TRANSCRIPT_FILTER")
Finally, visualizing the counts of negative control codewords, negative control probes, and unassigned codewords helps identify and understand technical artifacts and background noise in your spatial transcriptomics data.
Here, we can see that all control probes and codewords produce yield very little signal, suggesting our data is good quality!
In some cases, high amount of autoflourescence is the cells/tissue can sometimes generate false positive signal and this should be filtered out.
ImageFeaturePlot(seurat, "nCount_Negative.Control.Codeword") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
ImageFeaturePlot(seurat, "nCount_Negative.Control.Probe") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
ImageFeaturePlot(seurat, "nCount_Unassigned.Codeword") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Although the negative control signal is low, we can nonetheless create a filter to remove cells which have any, although in this case it is probably unnecessary.
seurat$PROBE_FILTER <- seurat$nCount_Unassigned.Codeword == 0 &
seurat$nCount_Negative.Control.Codeword == 0 &
seurat$nCount_Negative.Control.Probe == 0
ImageDimPlot(seurat, group.by="PROBE_FILTER")
Finally, we can subset the seurat object based on any/all of the filters we have created earlier.
By combining probe, size, and transcript filters, you can retain only the cells that meet all quality criteria, reducing the impact of technical artifacts and noise on your analysis.
seurat <- subset(seurat, PROBE_FILTER & SIZE_FILTER_LARGE & SIZE_FILTER_SMALL & TRANSCRIPT_FILTER)
Warning: Not validating FOV objectsWarning: Not validating Centroids objectsWarning: Not validating Centroids objectsWarning: Not validating FOV objectsWarning: Not validating Centroids objectsWarning: Not validating FOV objectsWarning: Not validating FOV objectsWarning: Not validating FOV objectsWarning: Not validating Seurat objects
Lets examine the cleaned up object - we have lost a few thousand cells from the analysis.
seurat
An object of class Seurat
541 features across 29336 samples within 4 assays
Active assay: XENIUM (325 features, 0 variable features)
1 layer present: counts
3 other assays present: Negative.Control.Codeword, Negative.Control.Probe, Unassigned.Codeword
1 spatial field of view present: COLON
Data Normalisation
The SCTransform function in Seurat is used for normalizing single-cell RNA-seq and spatial transcriptomics data. This method models the gene expression counts using a regularized negative binomial regression and removes technical noise while preserving biological variability. The clip.range parameter is used to limit the range of the transformed values, which can help stabilize downstream analyses by limiting the influence of extreme values.
seurat <- SCTransform(seurat, assay = "XENIUM", clip.range = c(-10, 10))
Running SCTransform on assay: XENIUM
Running SCTransform on layer: counts
vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.
Variance stabilizing transformation of count matrix of size 325 by 29336
Model formula is y ~ log_umi
Get Negative Binomial regression parameters per gene
Using 322 genes, 5000 cells
Found 4 outliers - those will be ignored in fitting/regularization step
Second step: Get residuals using fitted parameters for 325 genes
Computing corrected count matrix for 325 genes
Calculating gene attributes
Wall clock passed: Time difference of 4.586115 secs
Determine variable features
Centering data matrix
|
| | 0%
|
|===============================================================================================================================================| 100%
Getting residuals for block 1(of 6) for counts dataset
Getting residuals for block 2(of 6) for counts dataset
Getting residuals for block 3(of 6) for counts dataset
Getting residuals for block 4(of 6) for counts dataset
Getting residuals for block 5(of 6) for counts dataset
Getting residuals for block 6(of 6) for counts dataset
Centering data matrix
|
| | 0%
|
|===============================================================================================================================================| 100%
Finished calculating residuals for counts
Set default assay to SCT
Principal Component Analysis (PCA) is a dimensionality reduction technique used to identify the primary axes of variation in high-dimensional data. In the context of spatial transcriptomics, PCA helps to reduce the complexity of the data while preserving the most important patterns of variation.
TIP: If your target panel is very small, you can skip this step and carry out clustering analysis directly on gene expression. This can sometimes help with achieving better clustering results.
seurat <- RunPCA(seurat)
PC_ 1
Positive: IGFBP7, MS4A1, CXCR4, IL7R, BANK1, CLU, KLRB1, TIMP3, ETS1, CD79A
TRBC2, TRAC, MAF, CD3E, GPR183, CYBB, IRF8, TUBA1A, SPOCK2, CD2
FYB1, GIMAP7, CCR7, DPYSL3, TUBB, LTB, ANXA1, ARHGAP24, CD3D, CTSB
Negative: SLC26A3, CEACAM7, CEACAM5, CD24, MUC12, SLC26A2, CA1, CLCA4, CA4, MS4A12
CA2, AQP8, HHLA2, CEACAM1, CEACAM6, CDHR5, TMIGD1, CHP2, SELENBP1, CLCA1
GUCA2B, SDCBP2, DMBT1, CES2, C2orf88, CA7, FABP2, UGT2B17, NXPE4, COL17A1
PC_ 2
Positive: IGFBP7, TIMP3, CLCA1, TUBA1A, SLC12A2, CD24, THBS1, PPP1R1B, IMPDH2, ADH1C
CLU, TYMS, DPYSL3, CFTR, CPE, NXPE4, CES1, CDCA7, AQP1, C1QBP
RNASE1, UGT2B17, ALDH1B1, RGMB, SELENBP1, OLFM4, HMGB2, TUBB, SMOC2, RUNX1T1
Negative: MS4A1, CXCR4, IL7R, BANK1, CEACAM7, SLC26A3, KLRB1, CLCA4, TRBC2, CEACAM5
TRAC, CD79A, AQP8, CD3E, GPR183, CA4, CD2, SPOCK2, MS4A12, CCR7
ETS1, LTB, IRF8, PAX5, CD3D, FYB1, CD3G, CXCR5, FCRL1, CEACAM6
PC_ 3
Positive: CD24, CLCA1, PPP1R1B, SLC12A2, MS4A1, NXPE4, CXCR4, CFTR, SELENBP1, UGT2B17
BANK1, C1QBP, CA1, ADH1C, CHP2, IMPDH2, HMGB2, TYMS, CDCA7, ITLN1
UGT2A3, LEFTY1, MKI67, FERMT1, IL7R, LRMP, CD79A, CMBL, CA2, OLFM4
Negative: IGFBP7, TIMP3, CEACAM7, CLU, CLCA4, THBS1, SLC26A3, AQP8, CEACAM5, CA4
CPE, DPYSL3, ANXA1, MAF, MS4A12, LYVE1, RNASE1, GUCA2B, CES1, CEACAM6
RUNX1T1, CEACAM1, PLXND1, TMIGD1, FRZB, MS4A7, HHLA2, FZD7, CD163, CTSB
PC_ 4
Positive: MUC12, SLC12A2, MS4A1, CLU, TIMP3, CLCA4, CLCA1, CXCR4, CDCA7, BANK1
IGFBP7, THBS1, CEACAM7, AQP8, HMGB2, CEACAM5, SMOC2, AQP1, RGMB, CPE
OLFM4, DPYSL3, LGR5, ALDH1B1, TYMS, CEACAM6, L1TD1, IMPDH2, EPHB3, PPP1R1B
Negative: MS4A7, DNASE1L3, CTSB, C1QC, RNASE1, CA1, CD163, CYBB, C1QA, MAF
CD14, SLC26A2, SEC11C, CD24, CA2, TNFRSF17, C1QB, APOE, SELENBP1, FKBP11
LYVE1, CD79A, CHP2, PRDX4, ANXA1, SULT1B1, CES2, DERL3, PLXND1, UGT2B17
PC_ 5
Positive: CA1, IGFBP7, CD24, SLC26A2, MS4A1, CA2, TIMP3, SELENBP1, CLU, BANK1
CXCR4, CD79A, THBS1, CHP2, CES2, CPE, SULT1B1, UGT2B17, FCRL1, PAX5
NXPE4, ARHGAP24, ADH1C, DPYSL3, CMBL, CD79B, SMIM14, RUNX1T1, DEPP1, FZD7
Negative: SLC12A2, MS4A7, MUC12, CTSB, DNASE1L3, CLCA4, C1QC, RNASE1, CDCA7, CLCA1
TYMS, CD163, TUBA1A, HMGB2, C1QA, MAF, AQP8, OLFM4, MKI67, CEACAM5
LGR5, CD14, CYBB, IMPDH2, CEACAM7, RRM2, STMN1, RGMB, L1TD1, IL7R
As before, we can visualise how much variation is captured by each PC.
The ElbowPlot function helps to determine the number of significant PCs to use for downstream analyses. The plot typically shows the amount of variance explained by each PC, and the “elbow” point indicates a natural cutoff.
ElbowPlot(seurat, 50)
Plotting the top genes contributing to a specific principal component helps in understanding the biological factors driving the variation captured by that component. This type of plot highlights the genes with the highest loadings, which are the most influential in the principal component analysis.
PC_Plotting(seurat, dim_number = 1)
The FeaturePlot function in Seurat is used to visualize the expression of a specific gene across cells in a given dimensionality reduction space (e.g., PCA). This helps to understand how the expression of a gene varies across the principal components.
FeaturePlot(seurat, "CEACAM5", reduction = "pca") + scale_color_viridis_c()
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
We can also examine how various PCs are distributed spatially.
Here, we can see that high PC1 loadings enrich in follicular structures and low PC1 loadings enrich in crypt top cells.
ImageFeaturePlot(seurat, "PC_1") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
We can plot the expression of high (or low) loading genes to visualise how this correlates with our dimensionality reduction.
ImageFeaturePlot(seurat, "MS4A1", size=.5) + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Next, we will use the reduced dimensionality data for clustering and cluster visualisation.
RunUMAP: Perform Uniform Manifold Approximation and Projection (UMAP) to reduce the dimensionality of the data for visualization. The UMAP plot reduces the high-dimensional data to two dimensions, preserving the local and global structure of the data for visualization. Cells that are close together in the UMAP plot are similar in their gene expression profiles. seurat: The Seurat object. dims = 1:20: Specifies the principal components to use for UMAP.
FindNeighbors: Finding nearest neighbors helps to identify cells that are similar based on their PCA scores, which is used for clustering. seurat: The Seurat object. reduction = “pca”: Specifies that the PCA space should be used for finding neighbors. dims = 1:20: Specifies the principal components to use for identifying neighbors.
FindClusters: Clustering identifies distinct groups of cells with similar gene expression patterns. The resolution parameter controls the granularity of the clustering. seurat: The Seurat object. resolution = 0.7: Sets the resolution parameter for clustering. Higher values lead to more clusters, while lower values lead to fewer clusters.
seurat <- RunUMAP(seurat, dims = 1:20)
16:32:58 UMAP embedding parameters a = 0.9922 b = 1.112
Found more than one class "dist" in cache; using the first, from namespace 'spam'
Also defined by ‘BiocGenerics’
16:32:58 Read 29336 rows and found 20 numeric columns
16:32:58 Using Annoy for neighbor search, n_neighbors = 30
Found more than one class "dist" in cache; using the first, from namespace 'spam'
Also defined by ‘BiocGenerics’
16:32:58 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:33:03 Writing NN index file to temp file /project/.tmpRsessions/Rtmpr2ro8s/file9592d609faa65
16:33:03 Searching Annoy index using 1 thread, search_k = 3000
16:33:20 Annoy recall = 100%
16:33:21 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
16:33:24 Initializing from normalized Laplacian + noise (using RSpectra)
16:33:25 Commencing optimization for 200 epochs, with 1313314 positive edges
Using method 'umap'
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:33:43 Optimization finished
seurat <- FindNeighbors(seurat, reduction = "pca", dims = 1:20)
Computing nearest neighbor graph
Computing SNN
seurat <- FindClusters(seurat, resolution = 0.7)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 29336
Number of edges: 992160
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8802
Number of communities: 18
Elapsed time: 8 seconds
Next lets visualise the clusters - firstly, based on transcriptome embedding.
DimPlot: Creates a scatter plot of cells in a reduced-dimensional space, by default now using UMAP dimensionality reduction. seurat: The Seurat object containing the dimensionality reduction results and cluster assignments. label = TRUE: Adds cluster labels to the plot. repel = TRUE: Repels the labels to avoid overlapping, making the plot clearer.
DimPlot(seurat, label=T, repel=T)
And now lets plot the clusters in tissue space.
We can see that our clusters have quite nice correspondence to distinct spatial regions.
ImageDimPlot(seurat, size=.5)
Warning: No FOV associated with assay 'SCT', using global default FOV
As before, now we can use Seurat differential expression functions to identify marker genes for specific cell clusters.
FindMarkers: Identifies genes that are differentially expressed in a specified cluster compared to all other cells. seurat: The Seurat object containing the gene expression data and cluster identities. ident.1 = “0”: Specifies the cluster of interest for which marker genes are to be identified. In this case, cluster “0”. max.cells.per.ident = 500: Limits the number of cells to be used from each cluster for the differential expression analysis to 500. This can help to speed up the computation.
markers <- FindMarkers(seurat, ident.1="0", max.cells.per.ident=500)
head(markers)
We can visualise expression of cluster specific markers using feature plots
FeaturePlot(seurat, "CD3E", label=T, repel=T)+ scale_color_viridis_c(direction=-1)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
FeaturePlot(seurat, "MS4A1", label=T, repel=T)+ scale_color_viridis_c(direction=-1)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
FeaturePlot(seurat, "CEACAM5", label=T, repel=T)+ scale_color_viridis_c(direction=-1)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
FeaturePlot(seurat, "KIT", label=T, repel=T)+ scale_color_viridis_c(direction=-1)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Or, as in our sequencing ST tutorial, detect and visualise top markers for every cluster.
markers <- FindAllMarkers(seurat, max.cells.per.ident = 500)
Calculating cluster 0
Calculating cluster 1
Calculating cluster 2
Calculating cluster 3
Calculating cluster 4
Calculating cluster 5
Calculating cluster 6
Calculating cluster 7
Calculating cluster 8
Calculating cluster 9
Calculating cluster 10
Calculating cluster 11
Calculating cluster 12
Calculating cluster 13
Calculating cluster 14
Calculating cluster 15
Calculating cluster 16
Calculating cluster 17
head(markers)
scCustomize package provides a convenient helper function, Extract_Top_Markers, to extract the top marker genes for each cluster from the output of FindAllMarkers. This function simplifies the process of identifying and retrieving the most significant marker genes for analysis and visualisation.
In this case, we are extracting the top five markers per cluster.
top <- Extract_Top_Markers(markers, num_genes = 5, named_vector = FALSE, make_unique = TRUE)
top
[1] "COL19A1" "SMOC2" "MEIS2" "DPYSL3" "RUNX1T1" "MS4A1" "FCRL1" "PAX5" "BANK1" "TCL1A" "MKI67" "RRM2" "CDCA7"
[14] "LGR5" "LEFTY1" "CTLA4" "CD40LG" "ICOS" "IL7R" "ITK" "TNFRSF17" "DERL3" "FKBP11" "SEC11C" "CD79A" "CLU"
[27] "TIMP3" "DEPP1" "CPE" "AGTR1" "AQP8" "DUOX2" "CLCA4" "TFF1" "CEACAM7" "ANO7" "RETNLB" "ITLN1" "BEST2"
[40] "RAP1GAP" "CA1" "SELENBP1" "CA2" "NXPE4" "UGT2B17" "DNASE1L3" "C1QC" "MS4A7" "C1QB" "C1QA" "SLC26A3" "SLC26A2"
[53] "CD177" "MS4A12" "OTOP2" "BEST4" "CA7" "GUCA2B" "MSLN" "CD8A" "GZMA" "CCL5" "NKG7" "GNLY" "CHGA"
[66] "FEV" "INSM1" "CHGB" "GCG" "LYVE1" "CD163" "RNASE1" "PROX1" "FCN1" "MS4A2" "SLC18A2" "IL1RL1" "CPA3"
[79] "CTSG" "SH2D6" "TRPM5" "SH2D7" "AZGP1" "HTR3E" "CALB2" "PRPH" "SCG2" "SCGN" "RAB3B"
Clustered_DotPlot function from the scCustomize package provides a convenient and visually appealing way to display expression patterns of top marker genes across clusters using a dot plot. This function not only plots the expression data but also clusters the genes and groups for enhanced visual interpretation. This is an alternative to Seurat DotPlot function.
k = 18: Determines the number of clusters for the hierarchical clustering of genes to enhance visual separation of expression patterns.
We can see that most clusters have unique markers, which suggests the dataset is not over-clustered.
Clustered_DotPlot(seurat, features = top, k=18)
[[1]]
[[2]]
Additional Spatial Visualisations
The resolution of in situ datasets is typically very high and so it can be difficult to visualise everything in one plot. Below, we will explore different visualisations that can help unpick and understand the data a bit better.
To better visualise spatial distribution of clusters, sometimes it can be useful to subset only certain groups to reduce crowding. Here, we specifically only visualising two selected clusters.
WhichCells: Identifies cells based on specified criteria. seurat: The Seurat object. expression = seurat_clusters %in% c(0, 5): Logical expression to select cells belonging to clusters 0 and 5.
This works with ImageFeaturePlot too. Try it with some genes!
ImageDimPlot(seurat, cells=WhichCells(seurat, expression = seurat_clusters %in% c(0, 5)))
Warning: No FOV associated with assay 'SCT', using global default FOV
Sometimes, it can be useful to create additional fields of view of the data - for example, zooms of specific regions. First, let’s look at the coordinate system by plotting the data and turning on the plotting of the axes, which are off by default to create nicer looking plots.
This gives us a rough idea on where in the coordinate system to create any subsets or zooms of the data.
For example, if we want to zoom in on the follicle in the top right corner, we can see that it lies roughly between 4000-5000 and 8000-9000 coordinate regions.
ImageDimPlot(seurat, axes = T)
Warning: No FOV associated with assay 'SCT', using global default FOV
So, let’s create a new FOV with these coordinates. For this, we can use the Crop function.
seurat[[“COLON”]]: The spatial assay to be cropped. x = c(4200, 5000): The x-axis range for the crop. y = c(8000, 8800): The y-axis range for the crop. coords = “plot”: Specifies the coordinate system to use (typically “plot” for spatial coordinates).
seurat[[“ROI1”]] <- cropped: Adds the cropped region as a new FOV named “ROI1” in the Seurat object. This could be a more informative name, but avoid using underscores!
cropped <- Crop(seurat[["COLON"]], x = c(4200, 5000), y = c(8000, 8800), coords = "plot")
seurat[["ROI1"]] <- cropped
Warning: Key ‘XENIUM_’ taken, using ‘roi1_’ instead
Now we can limit our visualisations just to this region by specifying the name of the new FOV as an “fov” arguement.
As we are zooming in closer to the tissue, we can also switch from plotting cell centroids (i.e. dots) by default to visualising cell segmentation boundaries. Plotting cell boundary polygons for large FOVs can be quite time consuming, and doesn’t provide much more detail on a fully zoomed-out view.
ImageDimPlot(seurat, fov="ROI1", boundaries="segmentation", border.color = "black" )
We can visualise gene expression or other continous variable on the new FOV as before.
For example, here we have MS4A1/CD20 expression, which is a B-Cell marker. We can see it quite nicely limited to the lymphoid follicle.
ImageFeaturePlot(seurat, "MS4A1", fov="ROI1", boundaries="segmentation" , border.color = "black") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
We can also overlay the coordinates of individual molecules to the plot. For example, here we are added some more T-cell and B-cell specific markers.
This visualisation can be useful because molecules are stored independently of cells and cell boundaries in Seurat. Therefore, if there are regions where cell segmentation is not good, or if cells were filtered out from clustering analysis due to their low quality, the molecules will remain and can still be visualised this way.
For example, here we can see there are a few molecules of CXCR5 detected outside of cellular boundaries.
ImageFeaturePlot(seurat, "MS4A1", fov="ROI1", boundaries="segmentation", molecules=c("CXCR5", "FOXP3"), mols.size = .5, border.color = "black" ) + scale_fill_viridis_c()
Warning: minimal value for n is 3, returning requested palette with 3 different levels
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Cell Type Identification
You can manually annotate your cell clusters, or you can classify them using a reference single-cell dataset. This process is simpler than for Visium data because our data is at the single-cell level, establishing a one-to-one relationship without the need for spot deconvolution.
However, our transcriptome is more limited here, and some cell types may not be well represented. Additionally, our single-cell reference might be missing some cell types that are not well captured by droplet-based technologies but are present in our tissue data.
In this example, we will use a single-cell reference dataset that we prepared earlier.
We will start by reading in the seurat RDS file.
ref <- readRDS("/project/shared/spatial_data_camp/datasets/SINGLE_CELL_REFERENCES/COLON_HC_5K_CELLS.RDS")
Examine the object:
ref
An object of class Seurat
33556 features across 5725 samples within 3 assays
Active assay: RNA (33538 features, 2000 variable features)
3 layers present: counts, data, scale.data
2 other assays present: HTO, ADT
2 dimensional reductions calculated: pca, umap
And plot the pre-computed cell clusters. We can see that here we have quite high level annotation.
DimPlot(ref)
We want to evaluate how much structural information is lost in single-cell data when limiting ourselves to the targeted gene set. Accurate cluster prediction is challenging if the current gene set does not adequately identify them. To do this, we will quickly re-embedd the data using only the genes present in our spatial transcriptomics data and keep the original cluster annotations derived from unbiased data.
In this example, we can observe that the limited gene set does a reasonably good job at distinguishing major cell populations. However, it struggles to differentiate between similar cell types, such as myofibroblasts and fibroblasts, as effectively as before.
ref <- SCTransform(ref, residual.features =rownames(seurat))
Running SCTransform on assay: RNA
Computing residuals for the 325 specified features
vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.
Calculating cell attributes from input UMI matrix: log_umi
Variance stabilizing transformation of count matrix of size 19008 by 5725
Model formula is y ~ log_umi
Get Negative Binomial regression parameters per gene
Using 2000 genes, 5000 cells
Found 80 outliers - those will be ignored in fitting/regularization step
Skip calculation of full residual matrix
Calculating gene attributes
Wall clock passed: Time difference of 15.09929 secs
Determine variable features
Setting min_variance based on median UMI: 0.16
Calculating residuals of type pearson for 319 genes
|
| | 0%
|
|======================================================================== | 50%
|
|===============================================================================================================================================| 100%
Centering data matrix
|
| | 0%
|
|===============================================================================================================================================| 100%
Place corrected count matrix in counts slot
Set default assay to SCT
ref <- RunPCA(ref)
PC_ 1
Positive: IGFBP7, SOCS3, APOE, IFITM1, TUBA1A, TIMP3, ANXA1, SELENOM, FRZB, VCAN
DEPP1, TNFAIP3, THBS1, CXCR4, CCL5, IRF8, RARRES2, CLU, GIMAP7, CD83
CD3E, TUBB, GPR183, CCL4, IL7R, PTGER4, ETS1, CD7, RORA, DPYSL3
Negative: GUCA2A, SLC26A3, CEACAM7, CA1, CA2, SLC26A2, CEACAM1, AQP8, CEACAM5, CDHR5
MS4A12, CA4, SDCBP2, GUCA2B, SELENBP1, MYH14, CD24, MUC12, CD177, CLCA4
TMIGD1, HRCT1, CEACAM6, AREG, CES2, C2orf88, GNA11, IL32, SLC6A8, CDKN2B
PC_ 2
Positive: IGFBP7, SOCS3, APOE, GUCA2A, IFITM1, TIMP3, CEACAM7, SLC26A3, TUBA1A, CA4
AQP8, CEACAM1, SELENOM, MS4A12, GUCA2B, VCAN, THBS1, DEPP1, CEACAM5, SLC26A2
IL32, FRZB, CDHR5, TMIGD1, CLCA4, CD177, SDCBP2, HRCT1, CA1, ANXA1
Negative: CXCL3, ADH1C, WFDC2, PPP1R1B, LEFTY1, CCL5, DERL3, CD24, STMN1, UBE2C
CD79A, C1QBP, PTTG1, KRTCAP3, CCL20, AREG, TYMS, SLPI, SEC11C, OLFM4
CD3E, LGALS2, HMGB2, CDCA7, RHOV, CXCR4, SOX9, GPR183, TKT, LCN2
PC_ 3
Positive: CXCL3, ADH1C, APOE, WFDC2, CD24, SOCS3, IGFBP7, RARRES2, PPP1R1B, LEFTY1
SLPI, SELENBP1, KRTCAP3, UBE2C, C1QBP, STMN1, LGALS2, UGT2B17, CA2, LCN2
AREG, THBS1, OLFM4, TIMP3, TYMS, NXPE4, CES2, SOX9, CMBL, RHOV
Negative: CCL5, CD3E, ANXA1, CXCR4, IL7R, IL32, TNFAIP3, CD7, GPR183, KLRB1
CST7, LTB, CD3D, FYB1, CD2, NKG7, LEPROTL1, CCL4, SPOCK2, TRBC2
TRBC1, CD8A, CD83, PTGER4, XCL2, CCR7, RORA, CEACAM7, SELENOK, AQP8
PC_ 4
Positive: DERL3, CD79A, TNFRSF17, FKBP11, CCL4, SEC11C, CD83, CA4, AQP8, MS4A1
VPREB3, CD79B, GUCA2B, CEACAM7, PRDX4, GUCA2A, CEACAM1, C1QA, C1QC, C1QB
DNASE1L3, FCRLA, SELENOM, BANK1, MS4A7, CLCA4, FCER2, LRMP, IL1B, TCL1A
Negative: CXCL3, ANXA1, CD3E, ADH1C, CD24, IL7R, WFDC2, CA1, IL32, CA2
APOE, CD7, SELENBP1, KLRB1, AREG, PPP1R1B, CCL20, TNFAIP3, RARRES2, SLPI
SOCS3, CCL5, LEPROTL1, THBS1, LEFTY1, IFITM1, LGALS2, CD3D, UBE2C, STMN1
PC_ 5
Positive: APOE, CA1, CA2, SLC26A2, SELENBP1, THBS1, SLC26A3, CCL4, VCAN, CD24
DERL3, TNFAIP3, IRF8, MUC12, SOCS3, GPR183, CCL5, CD79A, CDHR5, CES2
LGALS2, C1QA, TNFRSF17, C1QC, CD14, C1QB, AREG, UGT2B17, ADH1C, FKBP11
Negative: CA4, IGFBP7, CLU, RNASE1, GIMAP7, TUBA1A, CA7, BEST4, STMN1, SPIB
GUCA2B, WFDC2, OTOP2, HEPACAM2, GUCA2A, AQP1, AQP8, ANXA1, TIMP3, IL32
CEACAM6, UBE2C, IFITM1, CEACAM1, KLK1, HES6, RETNLB, TUBB, CLCA4, CTSE
ref <- RunUMAP(ref, dims=1:20)
16:34:59 UMAP embedding parameters a = 0.9922 b = 1.112
Found more than one class "dist" in cache; using the first, from namespace 'spam'
Also defined by ‘BiocGenerics’
16:34:59 Read 5725 rows and found 20 numeric columns
16:34:59 Using Annoy for neighbor search, n_neighbors = 30
Found more than one class "dist" in cache; using the first, from namespace 'spam'
Also defined by ‘BiocGenerics’
16:34:59 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:34:59 Writing NN index file to temp file /project/.tmpRsessions/Rtmpr2ro8s/file9592d43a25bdc
16:34:59 Searching Annoy index using 1 thread, search_k = 3000
16:35:02 Annoy recall = 100%
16:35:03 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
16:35:06 Initializing from normalized Laplacian + noise (using RSpectra)
16:35:06 Commencing optimization for 500 epochs, with 231718 positive edges
Using method 'umap'
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:35:15 Optimization finished
DimPlot(ref, label=T, repel=T)
If we visualise the specificity of the gene panel across our single cell reference clusters, we can see that the panel coverage is mainly concentrated across epithelial cells and T-Cells and other immune cells, with few specific markers expressed by stromal cells.
ps <- AggregateExpression(ref, features = rownames(seurat), normalization.method = "LogNormalize", assays="RNA", return.seurat = T)
|
| | 0%
|
|============= | 9%
|
|========================== | 18%
|
|======================================= | 27%
|
|==================================================== | 36%
|
|================================================================= | 45%
|
|============================================================================== | 55%
|
|=========================================================================================== | 64%
|
|======================================================================================================== | 73%
|
|===================================================================================================================== | 82%
|
|================================================================================================================================== | 91%
|
|===============================================================================================================================================| 100%
ps <- ScaleData(ps, features=rownames(ps))
Centering and scaling data matrix
|
| | 0%
|
|===============================================================================================================================================| 100%
pheatmap(LayerData(ps, layer="scale.data"), show_rownames = F)
Next, we can use the standard Seurat integration and cross-classification workflow to transfer single-cell derived labels to our spatial object.
Briefly, the first function identifies anchors between the reference single-cell dataset (ref) and the query spatial dataset (seurat). Anchors are pairs of cells that are considered similar between the datasets. The normalization.method = “SCT” specifies that SCTransform normalization should be used.
The second step transfers the cell type labels from the reference dataset to the query dataset. The anchorset argument specifies the anchors found in the previous step. The refdata = ref$CellType argument specifies the cell type labels from the reference dataset to be transferred. The prediction.assay = TRUE argument indicates that the transferred labels should be stored in a new assay in the query dataset. The weight.reduction = seurat[[“pca”]] argument specifies the dimensionality reduction to be used for weighting the transfer, and dims = 1:30 specifies the number of dimensions to use.
anchors <- FindTransferAnchors(reference = ref,
query = seurat,
normalization.method = "SCT")
Normalizing query using reference SCT model
Warning: No layers found matching search pattern providedPerforming PCA on the provided reference using 319 features as input.
Projecting cell embeddings
Finding neighborhoods
Finding anchors
Found 5193 anchors
seurat <- TransferData(anchorset = anchors,
refdata = ref$CellType,
prediction.assay = TRUE,
weight.reduction = seurat[["pca"]],
query = seurat,
dims=1:30)
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Predicting cell labels
Warning: Layer counts isn't present in the assay object; returning NULL
Unfortunately, the predicted labels and spatial clusters do not correspond clearly in all cases. This discrepancy is particularly evident in the middle regions of the UMAP, where many cells are predicted as epithelial cells - probably incorrectly!
How to improve this?
Ensure Good Representation of Cell Type Markers in in situ Target Panel Most critically, before undertaking any experiments you want to ensure that there is good representation of all cell types in your target panel - in this case, there is not much to be done as the data has already been generated.
Review and Refine Reference Data: Ensure that the reference single-cell dataset is comprehensive and accurately annotated. If certain cell types are not well represented or annotated in the reference dataset, it can lead to misclassification.
Increase the Number of Dimensions: Increasing the number of dimensions used in the UMAP and PCA steps might capture more variance in the data, leading to better label transfer.
Filter and Preprocess Data: Filtering out low-quality cells or genes and performing additional preprocessing steps can enhance the accuracy of the transfer anchors and, consequently, the label predictions.
Manually Annotate or Correct Predictions: In cases where automatic label transfer is insufficient, consider manually annotating or correcting the predictions for critical regions to ensure accuracy.
DimPlot(seurat, group.by = "predicted.id")
As before, we can also visualise the predicted cell labels in tissue space.
ImageDimPlot(seurat, group.by = "predicted.id")
Warning: No FOV associated with assay 'SCT', using global default FOV
In line with non-specific predictions, we can also see that the prediction score across these areas is lower.
Outside of stromal cells, we can also see that prediction probability can be low in cells that embedd “between” clusters, for example between core T-Cells and B-Cells, two populations that should be distinct.
This is often the case where cell segmentation is imperfect and partitions transcripts in such a way that it generates “artificial” doublets by pulling in transcripts from an adjacent cell.
FeaturePlot(seurat, "predicted.id.score")
For example, if we visualise the lineage markers for T-Cells and B-Cells, we can see that they are often “co-expressed” in the same cells when biologically, they should not be.
The FeatureScatter function in Seurat is used to create a scatter plot showing the relationship between the expression levels of two genes across all cells. This visualization helps to identify potential correlations or patterns between the two genes.
FeatureScatter(seurat, "MS4A1", "CD3D", jitter=T)
FeaturePlot(seurat, c("MS4A1", "CD3D"))
To improve these artefacts, we can try alternative cell segmentation algorithms. What works best is very tissue dependant and there’s no easy one stop solution to this. Cell segmentation algorithms can be divided into a few groups.
Nuclei-based Segmentation algorithms primarily focus on identifying cell nuclei, which are usually more distinct and easier to detect than the cell boundaries. Once the nuclei are identified, the cell boundaries are inferred by expanding around the nuclei. This approach works well in tissues where the nuclei are clearly visible and distinct and in early versions of many in situ platforms, were the only available methods due to only using DAPI stain.
Cell Boundary-Based Segmentation algorithms (e.g. Cellpose) directly segments cells by identifying their boundaries. It is particularly effective for images with complex cell shapes and varying sizes, but this required good cell boundary staining - this is not available for our test dataset. Often cell boundary staining can be non-uniform across different tissues, adding further difficulties. Cellpose version 3 incorporates user-guided model training, which can be very useful for difficult to segment cell types - but this requires time investment to annotate training examples.
Transcript-Density Based Segmentation algorithms, like Baysor segments cells based on the spatial distribution of transcripts. It uses Bayesian inference to assign transcripts to cells, considering both the density and distribution of RNA molecules. This can be very useful for improving cell segmentation where cell boundary stain is not available or not working well.
In this case, we will try re-segmenting our data with Baysor. Here’s the run we prepared earlier - see supplementary material on how to process the data yourself.
baysor <- "/project/shared/spatial_data_camp/datasets/PRECOMPUTED/baysor"
The key output of baysor is the file with transcripts, which have been re-assigned to a new cell identifier.
seg <- read_csv(file.path(baysor, "segmentation.csv"))
Rows: 3463898 Columns: 19── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (4): gene, fov_name, cell, ncv_color
dbl (14): transcript_id, cell_id, overlaps_nucleus, x, y, z, qv, nucleus_distance, cell_id2, molecule_id, prior_segmentation, confidence, cluster, as...
lgl (1): is_noise
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(seg)
There will be some transcripts that cannot be assigned to a cell - about 10% in this case. This information is stored under “is_noise” flag. This is fairly normal levels of noise.
table(seg$is_noise)
FALSE TRUE
3256065 207833
baysor also calculates two confidence values - transcript assignment confidence represents the confidence that the transcript has been assigned to the correct cell.
qplot(seg$assignment_confidence)
table(seg$assignment_confidence > .9)
FALSE TRUE
1442325 2021573
And transcript confidence - the confidence that the molecule itself is real and not noise.
qplot(seg$confidence)
table(seg$confidence > .9)
FALSE TRUE
262472 3201426
We can filter out low confidence and low assignment confidence transcripts here from further analysis. How stringent you want to be depends on whether you want to keep as much data as possible and accept some inaccuracies, or end up with the cleanest possible dataset.
Here, we will filter out transcripts that have not been assigned to cells, and below 0.9 confidence and assignment confidence.
Then, we tabulate a cell by gene matrix from these data.
filtered <- seg[seg$confidence > .9 & seg$assignment_confidence > .9 & !seg$is_noise, ]
mat <- table(filtered$gene, filtered$cell)
mat <- matrix(mat, ncol = ncol(mat), dimnames = dimnames(mat))
Baysor further provides diagnostic info about cells in “segmentation_cell_stats.csv” file, which we will also read in here. The following parameters can be used to filter low-quality cells:
area: area of the convex hull around the cell molecules avg_confidence: average confidence of the cell molecules density: the number of molecules in a cell divided by the cell area elongation: ratio of the two eigenvalues of the cell covariance matrix n_transcripts: number of molecules per cell avg_assignment_confidence: average assignment confidence per cell. Cells with low avg_assignment_confidence have a much higher chance of being an artifact. max_cluster_frac (only if n-clusters > 1): fraction of the molecules coming from the most popular cluster. Cells with low max_cluster_frac are often doublets. lifespan: number of iterations the given component exists. The maximal lifespan is clipped proportionally to the total number of iterations. Components with a short lifespan likely correspond to noise.
stats <- read_csv(file.path(baysor, "segmentation_cell_stats.csv"))
Rows: 36875 Columns: 13── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): cell
dbl (12): x, y, z, cluster, n_transcripts, density, elongation, area, avg_confidence, avg_assignment_confidence, max_cluster_frac, lifespan
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
stats <- as.data.frame(stats)
rownames(stats) <- stats$cell
head(stats)
Now, we can assemble a seurat object as before - we first construct a basic object with cell by gene matrix and cell meta data.
seurat_reseg <- CreateSeuratObject(counts = mat, assay = "XENIUM", meta.data = as.data.frame(stats))
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')Warning: Data is of class matrix. Coercing to dgCMatrix.
Different segmentation algorithms output cell boundaries in various cell formats.
GeoJSON is a format for encoding various geographic data structures using JavaScript Object Notation (JSON). It is widely used for representing spatial features and their attributes and is used by some algorithms to store and output cell segmentation boundaries.
In R, we can read in polygon data from a GeoJSON file using the FROM_GeoJson function.
NOTE: baysor is a 3D cell segmentation algorithm. This means it considered z-stack information. Some algorithms only perform cell segmentation on a representative layer - e.g. Merscope Cellpose takes the middle z-stack. This resegmented data represents 3D segmentations of cells. Since these are 3D segmentations, they may look unusual when visualized as 2D projections.
polygons <- FROM_GeoJson(file.path(baysor, "segmentation_polygons.json"))
In the below code, we extract the polygon coordinates from the data and reformat them into a data frame that Seurat requires to construct a Segmentation object.
polygons <- lapply(1:length(polygons$geometries), FUN=function(x){
df <- as.data.frame(polygons$geometries[[x]]$coordinates)
df$cell_id <- paste0("CRef9694c57-", x)
df
})
polygons <- do.call(rbind, polygons)
colnames(polygons) <- c("x", "y", "cell_id")
polygons <- polygons[polygons$cell_id %in% Cells(seurat_reseg), ]
polygons <- CreateSegmentation(polygons)
Warning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygonWarning: less than 4 coordinates in polygon
Then, as before, we add both the cell centroid and cell boundaries as segmentations to the seurat object. We skip adding individual molecule coordinates for now.
cents <- CreateCentroids(stats[Cells(seurat_reseg), c("x", "y")])
cents@cells <- Cells(seurat_reseg)
coords <- CreateFOV(coords =list(centroids = cents, segmentation=polygons) ,
type = c("centroids", "segmentation"),
molecules = NULL,
assay = "XENIUM")
seurat_reseg[["COLON"]] <- coords
From here, we can use the seurat object to visualise various cell meta data - for example, average transcript assignment confidence per cell.
ImageFeaturePlot(seurat_reseg, "avg_assignment_confidence" ) + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Lets filter out low count cells and re-cluster the data as before
seurat_reseg$FILT <- seurat_reseg$nCount_XENIUM >= 15
seurat_reseg <- subset(seurat_reseg, FILT)
Warning: Not validating Centroids objectsWarning: Not validating Centroids objectsWarning: Not validating FOV objectsWarning: Not validating Centroids objectsWarning: Not validating FOV objectsWarning: Not validating FOV objectsWarning: Not validating FOV objectsWarning: Not validating Seurat objects
seurat_reseg <- SCTransform(seurat_reseg, assay = "XENIUM", clip.range = c(-10, 10))
Running SCTransform on assay: XENIUM
Running SCTransform on layer: counts
vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.
Variance stabilizing transformation of count matrix of size 538 by 20317
Model formula is y ~ log_umi
Get Negative Binomial regression parameters per gene
Using 352 genes, 5000 cells
Found 16 outliers - those will be ignored in fitting/regularization step
Second step: Get residuals using fitted parameters for 538 genes
Computing corrected count matrix for 538 genes
Calculating gene attributes
Wall clock passed: Time difference of 5.920289 secs
Determine variable features
Centering data matrix
|
| | 0%
|
|===============================================================================================================================================| 100%
Getting residuals for block 1(of 5) for counts dataset
Getting residuals for block 2(of 5) for counts dataset
Getting residuals for block 3(of 5) for counts dataset
Getting residuals for block 4(of 5) for counts dataset
Getting residuals for block 5(of 5) for counts dataset
Centering data matrix
|
| | 0%
|
|===============================================================================================================================================| 100%
Finished calculating residuals for counts
Set default assay to SCT
seurat_reseg <- RunPCA(seurat_reseg)
PC_ 1
Positive: SLC26A3, CEACAM5, CEACAM7, CD24, MUC12, SLC26A2, CA1, CLCA4, CA2, MS4A12
CA4, AQP8, CEACAM1, HHLA2, SELENBP1, CDHR5, CEACAM6, CHP2, TMIGD1, DMBT1
CES2, GUCA2B, SDCBP2, CLCA1, C2orf88, SULT1B1, UGT2B17, FABP2, CA7, NXPE4
Negative: IGFBP7, MS4A1, TIMP3, IL7R, CXCR4, CLU, MAF, ETS1, KLRB1, TRAC
CD79A, CYBB, BANK1, TRBC2, CD3E, ANXA1, TUBA1A, CTSB, MS4A7, CD2
SPOCK2, GPR183, FYB1, RNASE1, IRF8, DPYSL3, GIMAP7, TUBB, PLXND1, THBS1
PC_ 2
Positive: MS4A1, IL7R, CXCR4, KLRB1, TRAC, TRBC2, BANK1, CD3E, CD2, SLC26A3
SPOCK2, CEACAM7, ETS1, CLCA4, CD79A, GPR183, CD3G, CD3D, FYB1, CEACAM5
CD6, LTB, AQP8, IRF8, CCR7, FCRL1, MS4A12, CA4, PAX5, CEACAM1
Negative: IGFBP7, TIMP3, THBS1, CLU, TUBA1A, CLCA1, CD24, SLC12A2, DPYSL3, PPP1R1B
CPE, CES1, ADH1C, IMPDH2, AQP1, TYMS, RUNX1T1, SELENBP1, CFTR, NXPE4
RGMB, CDCA7, C1QBP, ANXA1, TUBB, FRZB, UGT2B17, PLXND1, SMOC2, ALDH1B1
PC_ 3
Positive: CD24, CLCA1, PPP1R1B, SLC12A2, NXPE4, SELENBP1, CFTR, C1QBP, MS4A1, UGT2B17
IMPDH2, HMGB2, TYMS, ADH1C, CDCA7, CXCR4, BANK1, MKI67, UGT2A3, ANO7
CHP2, ITLN1, LEFTY1, FERMT1, RRM2, LRMP, CD79A, OLFM4, HEPACAM2, CMBL
Negative: IGFBP7, TIMP3, CLCA4, CEACAM7, SLC26A3, CLU, CEACAM5, THBS1, AQP8, CA4
MAF, MS4A12, CEACAM1, CEACAM6, GUCA2B, ANXA1, MS4A7, TMIGD1, CPE, DPYSL3
PLXND1, HHLA2, RNASE1, CES1, CTSB, RUNX1T1, FRZB, LYVE1, C1QC, CD163
PC_ 4
Positive: IGFBP7, TIMP3, CLU, IL7R, THBS1, TRAC, KLRB1, TRBC2, CD3E, CD2
ETS1, CPE, DPYSL3, CD3G, SPOCK2, CD3D, CD6, RORA, CES1, IFITM1
RUNX1T1, AQP1, ANK2, GIMAP7, DEPP1, ITK, CCL5, CD5, CD8A, SMOC2
Negative: MS4A7, CYBB, CTSB, RNASE1, C1QC, CD163, DNASE1L3, MAF, CD14, C1QA
LYVE1, C1QB, IRF8, APOE, PLXND1, SEC11C, CD79A, FCER2, TNFRSF17, TUBA1A
MS4A1, TNFSF13B, FCN1, CLEC9A, FKBP11, SERPINA1, STMN1, CD83, LRMP, PRDX4
PC_ 5
Positive: IL7R, KLRB1, CD3E, TRAC, CD2, TRBC2, MAF, CD3G, CD3D, FYB1
CD6, MS4A7, RNASE1, CCL5, GIMAP7, CTSB, CD8A, C1QC, CD163, ITK
CTLA4, RORA, TRBC1, SPOCK2, DNASE1L3, CLCA1, CD5, TRAT1, CD14, C1QA
Negative: MS4A1, BANK1, CD79A, CXCR4, FCRL1, IRF8, PAX5, ARHGAP24, CD79B, IGFBP7
SPIB, LRMP, SEC11C, TNFRSF17, CLU, TIMP3, SELL, CXCR5, FKBP11, SMIM14
CYBB, DERL3, PRDX4, THBS1, CEACAM7, CLCA4, CD83, LTB, CPE, DPYSL3
seurat_reseg <- RunUMAP(seurat_reseg, dims = 1:20)
16:37:51 UMAP embedding parameters a = 0.9922 b = 1.112
Found more than one class "dist" in cache; using the first, from namespace 'spam'
Also defined by ‘BiocGenerics’
16:37:51 Read 20317 rows and found 20 numeric columns
16:37:51 Using Annoy for neighbor search, n_neighbors = 30
Found more than one class "dist" in cache; using the first, from namespace 'spam'
Also defined by ‘BiocGenerics’
16:37:51 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:37:55 Writing NN index file to temp file /project/.tmpRsessions/Rtmpr2ro8s/file9592d3041f1e6
16:37:55 Searching Annoy index using 1 thread, search_k = 3000
16:38:06 Annoy recall = 100%
16:38:09 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
16:38:13 Initializing from normalized Laplacian + noise (using RSpectra)
16:38:13 Commencing optimization for 200 epochs, with 916448 positive edges
Using method 'umap'
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:38:27 Optimization finished
seurat_reseg <- FindNeighbors(seurat_reseg, reduction = "pca", dims = 1:20)
Computing nearest neighbor graph
Computing SNN
seurat_reseg <- FindClusters(seurat_reseg, resolution = 0.3)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 20317
Number of edges: 743617
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9394
Number of communities: 13
Elapsed time: 4 seconds
Visualising clusters, we can see that we already obtain a better separation in the UMAP embedding than before. Though of course, distances in the UMAP space can be very misleading and careful interpretation is required.
DimPlot(seurat_reseg, label=T, repel = T)
Next we visualise the clusters in tissue space.
ImageDimPlot(seurat_reseg)
Warning: No FOV associated with assay 'SCT', using global default FOV
As before, lets cross-classify our cells using the reference single cell dataset
anchors <- FindTransferAnchors(reference = ref, query = seurat_reseg, normalization.method = "SCT")
Normalizing query using reference SCT model
Warning: No layers found matching search pattern providedPerforming PCA on the provided reference using 319 features as input.
Projecting cell embeddings
Finding neighborhoods
Finding anchors
Found 5749 anchors
seurat_reseg <- TransferData(anchorset = anchors, refdata = ref$CellType, prediction.assay = TRUE,
weight.reduction = seurat_reseg[["pca"]], query = seurat_reseg, dims=1:30)
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Predicting cell labels
Warning: Layer counts isn't present in the assay object; returning NULL
Visualising the predictions, we’ve separated T-Cells from B-Cells much better. The stromal clusters still predict poorly, but that is due to poor probe coverage.
DimPlot(seurat_reseg, group.by = "predicted.id")
We can check the distribution in tissue space:
ImageDimPlot(seurat_reseg, group.by = "predicted.id")
Warning: No FOV associated with assay 'SCT', using global default FOV
FeaturePlot(seurat_reseg,"predicted.id.score")
Spatial Neighbourhood Analyis
Spatial neighbourhood analysis identifies cells that are spatially close to each other within a tissue section. This technique helps to understand the spatial organization and potential interactions between cells. The same principles used in Visium data can be applied to in situ data.
GetTissueCoordinates: Retrieves the spatial coordinates of the centroids from the Seurat object. which = “centroids”: Specifies that the centroids’ coordinates should be retrieved. rownames(coords) <- coords$cell: Sets the row names of the coords data frame to the cell IDs.
FindNeighbors: Identifies the nearest neighbours for each cell based on their spatial coordinates. as.matrix(coords[, c(“x”, “y”)]): Converts the x and y coordinates to a matrix format. k.param = 20: Specifies the number of nearest neighbours to identify for each cell. return.neighbor = TRUE: Ensures that the function returns the neighbour indices and distances.
TIP: This approach identifies spatial neighbours. If you analysis requires precise identification of directly adjacent or interacting cell neighbours, then a delaunay network based approach would be more appropriate. R package Giotto implements some nice functionalities based on this
coords <- GetTissueCoordinates(seurat, which = "centroids")
rownames(coords) <- coords$cell
neighbours <- FindNeighbors(as.matrix(coords[, c("x", "y")]), k.param = 20, return.neighbor=TRUE)
Computing nearest neighbors
As in our Visium example, we can use it automatically select cells that are adjacent or physically close to some feature of interest. In this case, we want to automatically select any cells that are nearby the group of cells which have been designated as cluster 8, which are mid-crypt epithelial cells.
This can be useful to explore how cells in adjacent cells might influence or interact with each other.
WhichCells identifies the cells that belong to a specific cluster or meet a particular expression criterion
TopNeighbors finds the top n nearest neighbours for a given set of cells based on the k-NN graph. Increasing the n here will return more and more distal spots - tweak to your requirements!
How would you analyse these groups further? What are the adjacent cells? Are they heterogenous? What do they express?
cells <- WhichCells(seurat, expression= SCT_snn_res.0.7 == 8)
adjacent <- TopNeighbors(neighbours, cells, n = 10)
Idents(seurat) <- "Other Cells"
seurat <- SetIdent(seurat, cells = adjacent, "Adjacent Cells")
seurat <- SetIdent(seurat, cells = cells, "Cells of Interest")
ImageDimPlot(seurat)
Warning: No FOV associated with assay 'SCT', using global default FOV
seurat[["group1"]] <- Idents(seurat)
Finding Spatially Correlated Genes
We can use the spatial neighbourhood graph to identify spatially correlated features.
We start by calculating, for each cell, the pseudobulk expression of all cells in it’s local neighbourhood. First of all, lets expand the k.param to include more neighbours - this will control whether we’re including more distal gene expression.
FindNeighbors: Identifies the nearest neighbours for each cell based on their spatial coordinates. as.matrix(coords[, c(“x”, “y”)]): Converts the x and y coordinates to a matrix format. k.param = 50: Specifies the number of nearest neighbours to identify for each cell.
LayerData: Extracts the gene expression data from the specified layer and assay in the Seurat object. layer = “counts”: Specifies that the counts layer should be extracted. assay = “XENIUM”: Specifies the assay from which to extract the data.
as.matrix(neighbours$nn %*% t(mt)): Multiplies the neighbour matrix with the transpose of the gene expression matrix to aggregate the expression levels of neighbouring cells.
neighbours <- FindNeighbors(as.matrix(coords[, c("x", "y")]), k.param = 50)
Computing nearest neighbor graph
Computing SNN
mt <- LayerData(seurat, layer = "counts", assay = "XENIUM")
sum_mtx <- as.matrix(neighbours$nn %*% t(mt))
We can store the neighbourhood-aggregated values in our Seurat object as a separate assay, which we will call “NEIGHBOURHOOD50”. We then normalise the matrix.
seurat[["NEIGHBOURHOOD50"]] <- CreateAssayObject(t(sum_mtx))
seurat <- NormalizeData(seurat, assay = "NEIGHBOURHOOD50")
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
We can then apply quick correlation calculations to identify spatially correlated features.
corrgenes <- cor(as.matrix(t(LayerData(seurat, assay = "NEIGHBOURHOOD50", layer = "data"))))
diag(corrgenes) <- 0
high_corr_genes <- which(rowMaxs(corrgenes) > .7)
diag(corrgenes) <- 1
heatmap <- pheatmap(corrgenes[high_corr_genes, high_corr_genes], border_color = NA)
Here, we use a very quick cutree function to partition the hierachical clustering into 5 groups of co-localising genes.
TIP- for more advanced module detection, you can use the spatial neighbourhood aggregated matrix with tools such as WGCNA (though it has its own meta-cell functionality).
modules <- cutree(heatmap$tree_row, 5)
modules
ADH1C AKR1C3 ALDH1B1 ANK2 ANXA1 ANXA13 AQP1 AQP8 ARHGAP24 BANK1 BCAS1 BEST4 C1QA C1QBP C1QC C2orf88 CA1
1 1 2 3 3 1 2 4 5 5 4 4 2 2 2 4 4
CA2 CA4 CA7 CCR7 CD177 CD2 CD24 CD3D CD3E CD3G CD40LG CD5 CD6 CD79A CD79B CD83 CDCA7
4 4 4 5 4 5 1 5 5 5 5 5 5 5 5 5 2
CDHR5 CDK6 CDKN2B CEACAM1 CEACAM5 CEACAM6 CEACAM7 CES1 CES2 CFTR CHP2 CLCA1 CLCA4 CLU CMBL COL17A1 CREB3L1
4 2 4 4 4 4 4 3 1 1 1 1 4 3 1 4 1
CTLA4 CXCR4 CXCR5 CYBB DMBT1 DPYSL3 EBPL EPHB3 ETS1 FABP2 FCRL1 FCRLA FERMT1 FKBP11 FYB1 FZD7 GALNT5
5 5 5 5 4 3 2 2 5 1 5 5 1 2 5 3 1
GIMAP7 GNA11 GPR183 GUCA2A GUCA2B HEPACAM2 HHLA2 HMGB2 HRCT1 ICOS IFITM1 IGFBP7 IL7R IMPDH2 IRF8 ITK ITLN1
5 1 5 4 4 1 4 3 4 5 5 3 5 2 5 5 1
KLRB1 KRTCAP3 L1TD1 LEF1 LEFTY1 LGR5 LRMP LTB LYVE1 MAF MKI67 MS4A1 MS4A12 MUC12 MYH14 NXPE4 OLFM4
5 1 2 5 2 2 3 5 3 5 2 5 4 1 1 1 2
OTOP2 PAX5 PCLAF PLCE1 PLXND1 PPP1R1B RETNLB RGMB RORA RRM2 RUNX1T1 SCNN1A SDCBP2 SEC11C SELENBP1 SELENOM SELL
4 5 2 1 5 1 2 2 5 2 3 1 4 2 1 3 5
SLC12A2 SLC26A2 SLC26A3 SLC6A8 SOX9 SPOCK2 STMN1 SULT1B1 TCL1A TIGIT TIMP3 TK1 TKT TMIGD1 TNFRSF17 TRAC TRAT1
2 4 4 4 1 5 2 1 5 5 3 2 2 4 2 5 5
TRBC1 TRBC2 TUBA1A TUBB TYMS UGT2A3 UGT2B17
5 5 2 3 2 1 1
Lets visualize some of the detected spatially co-localizing genes. For example, module 2 genes - we can see that CEACAM6 and AQP8 are spatially similar, but not necessarily always expressed by the same cells.
ImageFeaturePlot(seurat, "CEACAM6") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
ImageFeaturePlot(seurat, "AQP8") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Similarly, module 4 genes which are expressed by different cell types, but show strong co-localisation:
ImageFeaturePlot(seurat, "CD3G") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
ImageFeaturePlot(seurat, "CD79B") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
We can use Seurat’s module score functionality to add the modules to the meta data as an aggregate score for ease of visualisation.
The AddModuleScore function in Seurat is used to calculate and add scores for predefined gene modules to a Seurat object. These module scores represent the average expression levels of a set of genes (modules) for each cell, which can be used to identify specific biological processes or cell states.
seurat: The Seurat object to which the module scores are added. features = split(names(modules), modules): Specifies the gene modules. The split function is used to create a list of gene sets, where each set corresponds to a specific module. assay = “SCT”: Specifies the assay from which to calculate the module scores. Here, “SCT” refers to the SCTransform-normalized data. nbin = 3: Divides the genes into 3 bins based on their average expression levels to account for differences in gene expression distributions. Need to reduce this from defaults as our number of measured genes is much lower than in single cell analysis. name = “MOD”: Specifies the prefix for the module score names. The resulting scores will be named “MOD1”, “MOD2”, etc.
seurat <- AddModuleScore(seurat, features=split(names(modules), modules), assay = "SCT", nbin=3, name = "MOD" )
Visualising module scores - we can see that we have identified a group of genes co-localising at the base of the epithelial crypts (MOD1) and another module of genes co-localising in lymphoid follicles.
ImageFeaturePlot(seurat, "MOD1") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
ImageFeaturePlot(seurat, "MOD4") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Detecting Cellular Niches
We can use a similar approach to define cellular niches.
Lets repeat the process of defining spatial neighbourhood expression, with the following modifications:
Expand the nearest neighbours even further, to consider 100 nearby cells.
Exclude the transcriptomes of the cells themselves and only consider the expression of nearby cells.
This approach will enable us to group cells based not on their identity, but on their microenvironment.
neighbours <- FindNeighbors(as.matrix(coords[, c("x", "y")]), k.param = 100)
Computing nearest neighbor graph
Computing SNN
diag(neighbours$nn) <- 0 # dont count transcriptome of the cell itself, just neighbours
mt <- LayerData(seurat, layer = "counts", assay = "XENIUM")
sum_mtx <- as.matrix(neighbours$nn %*% t(mt))
How is this useful? Well, now you can cluster cells not on their gene expression values, but gene expression values of surrounding cells. This effectively partitions cells not based on their identity, but on their micro-environment! Using this approach, you can identify tissue niches
Alternative approaches - you could count cell types rather than gene expression values, but that requires you to have finalised cell annotation for your dataset, which is not ideal. So, we do unbiased transcriptomics approach.
How would you run this with cell types?
seurat[["NEIGHBOURHOOD100"]] <- CreateAssayObject(t(sum_mtx))
Warning message:
In png(..., res = dpi, units = "in") :
unable to open connection to X11 display ''
DefaultAssay(seurat) <- "NEIGHBOURHOOD100"
seurat <- NormalizeData(seurat)
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
seurat <- ScaleData(seurat, features = rownames(seurat))
Centering and scaling data matrix
|
| | 0%
|
|===============================================================================================================================================| 100%
seurat <- RunPCA(seurat, features = rownames(seurat))
PC_ 1
Positive: ETS1, ARHGAP24, GIMAP7, MS4A1, TRAC, CLU, RORA, IL7R, BANK1, FYB1
CCR7, CD83, CXCR4, CXCR5, LTB, SELL, PAX5, FCRL1, CTLA4, TNFSF13B
SIT1, SPOCK2, CD79B, LEPROTL1, TBC1D4, IRF8, GPR183, ODF2L, CYBB, DPYSL3
Negative: CDHR5, CA2, COL17A1, CEACAM5, CHP2, HHLA2, MYH14, SLC26A2, CES2, FABP2
SDCBP2, C2orf88, CEACAM6, SULT1B1, CA1, CA4, BCAS1, CA7, SCNN1A, CEACAM1
DMBT1, MUC12, MS4A12, SLC26A3, AKR1C3, GNA11, SELENBP1, CMBL, CDKN2B, TMIGD1
PC_ 2
Positive: SLC12A2, ALDH1B1, PPP1R1B, LEFTY1, RGMB, CDCA7, ANXA13, SOX9, WFDC2, HPGDS
IMPDH2, RETNLB, OLFM4, EPHB3, TYMS, TUBA1A, TKT, L1TD1, ITLN1, TK1
AZGP1, STXBP6, GALNT8, C1QBP, FERMT1, PCLAF, TRPM5, AQP1, BMX, ADH1C
Negative: SFXN1, DNASE1L3, TMIGD1, AQP8, BEST4, GUCA2B, GUCA2A, CEACAM7, TNFAIP3, IL32
MS4A12, CD177, CLCA4, OTOP2, SLC26A3, CD3G, CD3D, CD40LG, ADRA2A, PI3
GPR183, CD2, LEF1, CDKN2B, GZMA, CA4, SELL, TRBC2, CD6, CD5
PC_ 3
Positive: THBS1, FRZB, LYVE1, RNASE1, CD163, AGTR1, ANXA1, VCAN, CTSG, TIMP3
MS4A7, GATA2, CD14, MAOB, MEIS2, C1QB, FZD7, CMA1, CES1, CPE
C1QA, AFAP1L2, CPA3, IL1RL1, PLXND1, C1QC, KRT86, SLC18A2, MS4A2, IGFBP7
Negative: SPIB, SMIM14, CD24, RORC, CD79A, LRMP, KRTCAP3, MKI67, TRBC2, CXCR4
RRM2, LGALS2, RAB26, LEFTY1, GPRIN3, FYB1, ITLN1, CD6, TIGIT, KLK1
IRF8, KLRB1, PAX5, CD3E, BANK1, UBE2C, ANXA13, CD2, C1QBP, ITK
PC_ 4
Positive: DUOX2, IL17RB, TFF1, LGR5, RAB3B, HES6, ASCL2, CHGB, SMOC2, SMIM6
CADPS, L1TD1, SFXN1, RFX6, VWA5B2, INSM1, AREG, PROX1, SCGN, COL19A1
EPHB3, SCG5, CCL20, RETNLB, RGS13, SCG3, FEV, CRYBA2, CHGA, TRPM5
Negative: TNFRSF17, SEC11C, FKBP11, DERL3, CD79A, PRDX4, ADH1C, LGALS2, SELENOK, ID2
PTGER4, AKR7A3, CA1, SLPI, APOE, DNASE1L3, CD24, CKAP4, SMIM14, SLC26A2
CES2, CD14, C1QC, AKR1C3, CA2, VCAN, SELENBP1, TKT, SULT1B1, IGFBP7
PC_ 5
Positive: DNASE1L3, C1QC, C1QA, ASCL2, C1QB, CTSB, MS4A7, MUC12, CEACAM5, CD163
CLEC9A, SERPINA1, LGR5, SCNN1A, CEACAM1, CLCA4, CD14, SLC29A4, HES6, RAB3B
KLRB1, CYBB, AQP8, CEACAM6, ANXA1, CRYBA2, CCL20, SMIM6, PI3, NKX2-2
Negative: SLPI, PDE4C, CMA1, SMIM14, GATA2, AGTR1, DEPP1, ADH1C, EBPL, PDZK1IP1
GPRC5C, LYVE1, FCN1, IL1RL1, SPDEF, ADRA2A, TBC1D4, ROBO2, IFITM1, CTSG
MS4A2, AKR7A3, ABCC8, CPE, CKAP4, CLU, HDC, PBK, NOSIP, TKT
seurat <- FindNeighbors(seurat, reduction = "pca", dims = 1:10)
Computing nearest neighbor graph
Computing SNN
seurat <- FindClusters(seurat, resolution = 0.1, cluster.name = "Niches")
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 29336
Number of edges: 716602
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9670
Number of communities: 11
Elapsed time: 2 seconds
Lets visualise the detected “niches”. We can see that we have achieved a coarse partioning of the cells into crypt top, mid-crypt and crypt-base regions, as well as segmenting out follicles and sub-mucosal stroma.
How would you tweak the above approach to generate more or less granular niches?
ImageDimPlot(seurat, group.by = "Niches")
Warning: No FOV associated with assay 'NEIGHBOURHOOD100', using global default FOV
We can tabulate our detected niches with predicted cell type labels (or clusters) to visualise enrichment of different cell types across spatial niches.
For example, as could be expected, T-Cells and B-Cells enrich in Niche 2 (follicular).
comp <- table(seurat$Niches, seurat$predicted.id)
pheatmap(comp, scale="row")
comp <- table(seurat$Niches, seurat$SCT_snn_res.0.7)
pheatmap(comp, scale="row")
Detecting Adjacency Dependent Gene Expression Differences
What if we want to know how expression profiles of specific cell types change based on what they’re adjacent to? We can use the exactly same approach, except for each cell neighbourhood, we only aggregate the expression of cell types of interest.
So now for each cell, our matrix is the sum of expression values of nearby specific cell types.
As an example, here we will ask how expression of epithelial cells changes depending on what they are adjacent to. For simplicity, here we will use predicted cell type IDs from single cell reference data.
How would you select a different group of cells?
neighbors <- FindNeighbors(as.matrix(coords[, c("x", "y")]), k.param = 10)
Computing nearest neighbor graph
Computing SNN
neighbors$nn <- neighbors$nn[coords$cell, coords$cell]
diag(neighbors$nn) <- 0 # dont count transcriptome of the cell itself, just neighbours?
mt <- LayerData(seurat, layer = "counts", assay = "XENIUM")[, coords$cell]
mt[, seurat$predicted.id != "Epithelium"] <- 0 # Zero counts for cells that are not of interest
sum_mtx <- as.matrix(neighbors$nn %*% t(mt))
As before, we store the aggregated matrix as a new assay, except here we further filter out any cells that do not have any epithelial cells nearby.
seurat[["EPI"]] <- CreateAssayObject(t(sum_mtx[rowSums(sum_mtx) > 0, ]))
DefaultAssay(seurat) <- "EPI"
seurat <- NormalizeData(seurat)
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
So, how does gene expression of epithelial cells change depending on whether they are in proximity to macrophages (cluster 9), or myofibroblasts (cluster 0).
First, lets visualise the spatial distribution of these cells:
Idents(seurat) <- "SCT_snn_res.0.7"
ImageDimPlot(seurat, cells = WhichCells(seurat, expression=SCT_snn_res.0.7 %in% c( 9, 0)))
Warning: No FOV associated with assay 'EPI', using global default FOV
We can use FindMarkers function to identify differences between adjacent epithelial cells for these groups, if we run it on the aggregated EPI matrix:
markers <- FindMarkers(seurat, 0, 9)
markers
Let’s visualise some of the top hits - we can see they separate nicely in epithelial cells:
VlnPlot(seurat, c("CA7", "SMOC2"), idents = c(0, 9), assay = "EPI", pt.size = .1, alpha = 0.5)
As a sanity check, if we visualise actual, not epi-adjacent expression in myofibroblasts and macrophages, we can see that there’s hardly any expression at all. So, we are picking up epithelial signal
VlnPlot(seurat, c("CA7", "SMOC2"), idents = c(0, 9), assay = "SCT", pt.size = .1, alpha = 0.5)
And a visual check
DefaultAssay(seurat) <- "SCT"
ImageFeaturePlot(seurat, "CA7") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
ImageFeaturePlot(seurat, "SMOC2") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Finally, lets save all of our analysis as an RDS object
saveRDS(seurat, file="colon_in_situ.RDS")
sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /package/R-base/4.3.0/lib/R/lib/libRblas.so
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8
[6] LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/London
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] geojsonR_1.1.1 spdep_1.3-5 sf_1.0-16 spData_2.3.1 pheatmap_1.0.12
[6] readr_2.1.4 wrMisc_1.15.1 corrplot_0.92 MuSiC_1.0.0 TOAST_1.14.0
[11] quadprog_1.5-8 limma_3.56.2 EpiDISH_2.16.0 nnls_1.5 CARD_1.1
[16] clustree_0.5.1 ggraph_2.1.0 scCustomize_2.1.2 SpatialExperiment_1.10.0 SpotClean_1.7.1
[21] DropletUtils_1.20.0 SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.2 Biobase_2.60.0 GenomicRanges_1.52.0
[26] GenomeInfoDb_1.36.2 IRanges_2.34.1 S4Vectors_0.38.1 BiocGenerics_0.46.0 MatrixGenerics_1.12.3
[31] matrixStats_1.0.0 ggplot2_3.5.1 Seurat_5.1.0 SeuratObject_5.0.2 sp_2.0-0
loaded via a namespace (and not attached):
[1] R.methodsS3_1.8.2 vroom_1.6.3 tiff_0.1-12 locfdr_1.1-8 goftest_1.2-3
[6] HDF5Array_1.28.1 vctrs_0.6.3 spatstat.random_3.1-5 digest_0.6.33 png_0.1-8
[11] corpcor_1.6.10 shape_1.4.6 MCMCpack_1.7-0 proxy_0.4-27 Rfast2_0.1.5.2
[16] registry_0.5-1 ggrepel_0.9.3 deldir_1.0-9 parallelly_1.36.0 RcppML_0.3.7
[21] Rnanoflann_0.0.3 magick_2.7.5 MASS_7.3-58.4 reshape2_1.4.4 httpuv_1.6.11
[26] foreach_1.5.2 withr_2.5.0 ggrastr_1.0.2 xfun_0.40 ggfun_0.1.2
[31] ellipsis_0.3.2 survival_3.5-5 s2_1.1.7 ggbeeswarm_0.7.2 janitor_2.2.0
[36] MatrixModels_0.5-2 zoo_1.8-12 GlobalOptions_0.1.2 gtools_3.9.4 pbapply_1.7-2
[41] R.oo_1.25.0 GGally_2.2.1 rematch2_2.1.2 promises_1.2.1 httr_1.4.7
[46] globals_0.16.2 fitdistrplus_1.1-11 rhdf5filters_1.12.1 rhdf5_2.44.0 rstudioapi_0.15.0
[51] units_0.8-5 miniUI_0.1.1.1 generics_0.1.3 concaveman_1.1.0 fields_15.2
[56] zlibbioc_1.46.0 polyclip_1.10-4 RcppZiggurat_0.1.6 GenomeInfoDbData_1.2.10 xtable_1.8-4
[61] stringr_1.5.0 doParallel_1.0.17 evaluate_0.21 S4Arrays_1.2.1 Rfast_2.1.0
[66] hms_1.1.3 irlba_2.3.5.1 colorspace_2.1-0 hdf5r_1.3.10 ROCR_1.0-11
[71] reticulate_1.31 spatstat.data_3.0-1 magrittr_2.0.3 lmtest_0.9-40 snakecase_0.11.1
[76] later_1.3.1 viridis_0.6.4 lattice_0.21-8 glmGamPoi_1.12.2 spatstat.geom_3.2-4
[81] NMF_0.27 future.apply_1.11.0 SparseM_1.81 scattermore_1.2 scuttle_1.10.2
[86] cowplot_1.1.1 RcppAnnoy_0.0.21 class_7.3-21 pillar_1.9.0 nlme_3.1-162
[91] iterators_1.0.14 gridBase_0.4-7 compiler_4.3.0 beachmat_2.16.0 RSpectra_0.16-1
[96] stringi_1.7.12 tensor_1.5 lubridate_1.9.2 plyr_1.8.8 crayon_1.5.2
[101] abind_1.4-5 locfit_1.5-9.8 graphlayouts_1.0.0 bit_4.0.5 dplyr_1.1.3
[106] codetools_0.2-19 bslib_0.5.1 e1071_1.7-14 paletteer_1.6.0 GetoptLong_1.0.5
[111] plotly_4.10.2 mime_0.12 splines_4.3.0 circlize_0.4.15 Rcpp_1.0.11
[116] fastDummies_1.7.3 quantreg_5.97 sparseMatrixStats_1.12.2 prismatic_1.1.2 knitr_1.43
[121] utf8_1.2.3 clue_0.3-64 listenv_0.9.0 checkmate_2.2.0 DelayedMatrixStats_1.22.6
[126] tibble_3.2.1 Matrix_1.6-5 tzdb_0.4.0 tweenr_2.0.2 pkgconfig_2.0.3
[131] tools_4.3.0 cachem_1.0.8 viridisLite_0.4.2 DBI_1.1.3 fastmap_1.1.1
[136] rmarkdown_2.24 scales_1.3.0 grid_4.3.0 readbitmap_0.1.5 pbmcapply_1.5.1
[141] ica_1.0-3 sass_0.4.7 patchwork_1.2.0 coda_0.19-4 ggprism_1.0.5
[146] ggstats_0.6.0 dotCall64_1.0-2 RANN_2.6.1 farver_2.1.1 tidygraph_1.2.3
[151] scatterpie_0.2.1 wk_0.9.2 yaml_2.3.7 bmp_0.3 cli_3.6.1
[156] purrr_1.0.2 leiden_0.4.3 lifecycle_1.0.3 uwot_0.1.16 presto_1.0.0
[161] backports_1.4.1 ggcorrplot_0.1.4.1 BiocParallel_1.34.2 timechange_0.2.0 gtable_0.3.4
[166] rjson_0.2.21 ggridges_0.5.4 progressr_0.14.0 parallel_4.3.0 jsonlite_1.8.7
[171] edgeR_3.42.4 RcppHNSW_0.4.1 bitops_1.0-7 bit64_4.0.5 Rtsne_0.16
[176] spatstat.utils_3.0-3 RcppParallel_5.1.8 jquerylib_0.1.4 dqrng_0.3.1 R.utils_2.12.2
[181] lazyeval_0.2.2 shiny_1.7.5 htmltools_0.5.6 sctransform_0.4.1 glue_1.6.2
[186] spam_2.9-1 XVector_0.40.0 RCurl_1.98-1.12 mcmc_0.9-8 classInt_0.4-10
[191] jpeg_0.1-10 gridExtra_2.3 boot_1.3-28.1 igraph_1.5.1 R6_2.5.1
[196] tidyr_1.3.0 forcats_1.0.0 labeling_0.4.3 cluster_2.1.4 rngtools_1.5.2
[201] Rhdf5lib_1.22.0 DelayedArray_0.26.7 tidyselect_1.2.0 vipor_0.4.5 maps_3.4.1
[206] ggforce_0.4.1 future_1.33.0 munsell_0.5.0 KernSmooth_2.23-20 data.table_1.14.8
[211] ComplexHeatmap_2.16.0 htmlwidgets_1.6.2 RColorBrewer_1.1-3 rlang_1.1.4 spatstat.sparse_3.0-2
[216] spatstat.explore_3.2-1 Cairo_1.6-1 fansi_1.0.4 beeswarm_0.4.0